Exploring an enriched Iowa liquor sales dataset

by Alex Spanos

Introduction

Welcome to my Exploratory Data Analysis project for Udacity’s Data Analyst Nanodegree. As I wanted my work to be somewhat more original, I forewent the option of analysing one of the datasets in the Udacity list and went on to spend a few weeks trying to locate an appropriate dataset online; appropriate in this context meant firstly satisfying Udacity prerequisites and secondly being a real-world recent dataset, relatively “virgin” with respect to analysis attempts.

There are a few excellent directories of free datasets out there, as well as free APIs. The one that intrigued me the most though was the Iowa Liquor Sales dataset. This dataset contains transactions between the relevant Iowa state authority and liquor stores across the state with a very high level of detail. I was surprised to find out that in Iowa alcohol sales are state-controlled; it seems that all orders have to go through the state! Anyway, the reasons why I found this dataset interesting were the following:

The final point was quite important for me, because what mostly intrigues me in this field is discovering relationships and patterns which are not directly obvious (well I guess that applies for most of us here).

So I then spent a bit of time trying to think of alternative data that I could merge into the master (raw) dataset. Given the richly detailed data at zipcode level, I managed to pull from the internet the following zipcode level information:

I did not know beforehand whether this would definitely provide interesting insights with regards to liquor consumption, but at least I could supplement my project with some descriptives about Iowa (which I knew next to nothing about!). Additionally, it was greatly enjoyable to exercise the skills I learnt in the Nanodegree’s Data Wrangling course. I got to use a few API’s, scraped some tables, used some regex and more in order to get all my auxiliary data sets in a neat csv format. Here I cheated a bit and used Python. After this data was in csv format though I performed all subsequent data merging operations in R.

For the analysis purposes of this project I used RStudio on my Windows 7 laptop and also used git and GitHub for version control.

So, on to the data analysis.

Data

Loading

Please see relevant scripts * here.

Dataset features

Initial univariate exploration

The variables in the dataset are listed below:

##  [1] "ZIPCODE"         "DATE"            "STORE"          
##  [4] "NAME"            "ADDRESS"         "CITY"           
##  [7] "STORE LOCATION"  "COUNTY NUMBER"   "CATEGORY"       
## [10] "CATEGORY NAME"   "VENDOR NO"       "VENDOR"         
## [13] "DESCRIPTION"     "PACK"            "LITER SIZE"     
## [16] "STATE BTL COST"  "BTL PRICE"       "BOTTLE QTY"     
## [19] "TOTAL"           "TOTAL_EST"       "POPULATION"     
## [22] "PRCP"            "SNOW"            "TMIN"           
## [25] "TMAX"            "Perc_arab"       "Perc_asian"     
## [28] "Perc_black"      "Perc_chinese"    "Perc_dutch"     
## [31] "Perc_english"    "Perc_french"     "Perc_german"    
## [34] "Perc_greek"      "Perc_hispanic"   "Perc_indian"    
## [37] "Perc_irish"      "Perc_italian"    "Perc_japanese"  
## [40] "Perc_mexican"    "Perc_native"     "Perc_polish"    
## [43] "Perc_russian"    "Perc_scottish"   "Perc_white"     
## [46] "BOTTLE_QTY_NORM" "TOTAL_EST_NORM"  "STORE_LAT"      
## [49] "STORE_LON"

Columns from the original liquor sales data are 1-18; others come from the auxiliary sources.

An example row in the dataset can be seen below.

##    ZIPCODE       DATE STORE                 NAME      ADDRESS  CITY
## 1:   50002 2014-12-08  4417 KUM & GO #76 / ADAIR 109 S 5TH ST ADAIR
##                                                         STORE LOCATION
## 1: 109 S 5TH ST\nADAIR 50002\n(41.49807185700007, -94.643468852999945)
##    COUNTY NUMBER CATEGORY     CATEGORY NAME VENDOR NO
## 1:             1  1012100 CANADIAN WHISKIES       115
##                              VENDOR  DESCRIPTION PACK LITER SIZE
## 1: Constellation Wine Company, Inc. Black Velvet   12        750
##    STATE BTL COST BTL PRICE BOTTLE QTY TOTAL TOTAL_EST POPULATION PRCP
## 1:           5.23      7.85         12  94.2        22       1297   NA
##    SNOW TMIN TMAX Perc_arab Perc_asian Perc_black Perc_chinese Perc_dutch
## 1:   NA   NA   NA         0       0.37          0            0       1.46
##    Perc_english Perc_french Perc_german Perc_greek Perc_hispanic
## 1:        11.05        1.38       41.13          0          1.04
##    Perc_indian Perc_irish Perc_italian Perc_japanese Perc_mexican
## 1:        0.22       9.75         0.16          0.07         0.89
##    Perc_native Perc_polish Perc_russian Perc_scottish Perc_white
## 1:        0.07         0.4            0          1.05       98.8
##    BOTTLE_QTY_NORM TOTAL_EST_NORM STORE_LAT STORE_LON
## 1:      0.00925212     0.01696222  41.49807 -94.64347

The dataset is quite rich and holds many features of potential interest. For the purposes of this project I opted to focus on the drinking patterns of Iowans with respect to time of year and locality.

As an initial attempt at producing descriptives for the dataset, I am defining below three functions that generate histograms and calculate means and number of observations. I have split the variables into three sets; the original liquor sales variables, the added meteorological variables and the added demographic variables. For the latter two categories, the statistics were obtained on data grouped by zipcode in order to be more correct.

For the following plots I used multiplot; a very useful function from cookbook-r.

The liquor sales variables in this case are number of bottles per ordered pack, volume of bottle in millilitres, the cost of the bottle for the state, the price it was sold to the merchant for, the total number of bottles ordered and the total money spent per order for the merchant. Let’s look at their distributions.

It is not really clear what’s going on so I am going to log-transform the x-axis of the cost-related variables and zoom-in closer for “PACK”, “LITER SIZE” and “BOTTLE QTY”.

So now we can see that distributions of the cost-related have transformed to bell-shaped or normal-like. We can also clearly see the discrete nature of the other variables.

Now, let’s see descriptives of the meteorological variables; again these were derived after grouping by zipcode-day pairs.

For precipitation and snow it’s hard to discern what’s going on so I will log-transform x-axes; for the temperatures I will just make the bins finer. Here are the plots:

I would like to note that in general I am a bit disappointed by the completeness of the meteorological data that I ordered from the National Climatic Data Center; I also don’t understand the units used for temperatures, as any transformations I tried did not return reasonable planet Earth values..

The above plots are a bit hard to interpret as they are, because zipcodes are not represented uniformly; e.g. zipcode x may have meteorological readings for all days whereas others just for 5 days. If one desires to perform some inference on Iowa weather, the simplest way would be to generate the same plots for individual zipcodes.

Now onto the final partition of variables, the demographics. These include ethnic groups/nationalities percentages and total drinking establishments - grouped by zipcode.

From plots 14, 15 and 32 I guess it is clear that Iowa’s population is predominantly white. Without knowing too much about US geodemographics, presumably this is a “heartland” characteristic?

Otherwise I am surprised to see the magnitude of the German population - almost 50% on average! It can’t be true that half the Iowa population is German people, so I guess these categories pertain to “heritage” in general. Which means almost half of the Iowans are of German descent.

Also common is English, Scottish, French and Dutch heritage. In these histograms, one can see that the black and Hispanic and white group distributions exhibit long-tails, implying that there exist specific areas where these populations are much more (or less) prevalent than generally encountered in Iowa.

The “outlier” plot category in this panel is the total number of drinking establishments per zipcode - included here for thoroughness purposes.

In addition to the above, I think it would be useful to show the most popular categories of liquors and individual types.

The dataset contains 69 different categories, so it will be hard to visualise all of them in a plot. In this case just showing a table is also informative.

##                         CATEGORY NAME bottles      price
## 1                      80 PROOF VODKA 5034321   9.175411
## 2                   CANADIAN WHISKIES 3063617  13.687787
## 3                          SPICED RUM 1781396  14.380130
## 4                     WHISKEY LIQUEUR 1092220  16.440005
## 5                             TEQUILA 1080224  20.358524
## 6                    BLENDED WHISKIES 1069756  10.400273
## 7           STRAIGHT BOURBON WHISKIES 1067546  17.220917
## 8                      IMPORTED VODKA 1008599  20.268337
## 9    PUERTO RICO & VIRGIN ISLANDS RUM  992118  11.405754
## 10                     FLAVORED VODKA  966793  10.935803
## 11            AMERICAN GRAPE BRANDIES  689098   8.059325
## 12                 TENNESSEE WHISKIES  688586  20.747060
## 13                  AMERICAN DRY GINS  613832   9.085220
## 14                       FLAVORED RUM  538311  12.678766
## 15                 AMERICAN COCKTAILS  534658  10.689236
## 16 MISC. IMPORTED CORDIALS & LIQUEURS  487998  21.449140
## 17              IMPORTED VODKA - MISC  474696  16.909476
## 18                     CREAM LIQUEURS  431367  17.743147
## 19            IMPORTED GRAPE BRANDIES  372098  21.058585
## 20                  IMPORTED SCHNAPPS  349139  16.437759
## 21                    SCOTCH WHISKIES  301218  23.689421
## 22 MISC. AMERICAN CORDIALS & LIQUEURS  253863  11.564033
## 23        DISTILLED SPIRITS SPECIALTY  225048  15.317123
## 24                         TRIPLE SEC  224300   3.974883
## 25     DECANTERS & SPECIALTY PACKAGES  216893  24.330833
## 26                PEPPERMINT SCHNAPPS  213255   7.174323
## 27                  IMPORTED DRY GINS  212966  22.151287
## 28                     IRISH WHISKIES  208518  25.281346
## 29                     PEACH SCHNAPPS  151208   9.646169
## 30                    COFFEE LIQUEURS  136482  15.525798
## 31              STRAIGHT RYE WHISKIES  126953  26.339466
## 32                BLACKBERRY BRANDIES  121103   9.163894
## 33                  AMERICAN AMARETTO  119461   6.739355
## 34                 SINGLE MALT SCOTCH   87382  43.527749
## 35                     APPLE SCHNAPPS   65761  10.148421
## 36                   AMERICAN ALCOHOL   57573  13.408440
## 37              BUTTERSCOTCH SCHNAPPS   57166   9.235604
## 38                  CINNAMON SCHNAPPS   48640  11.526123
## 39                   APRICOT BRANDIES   47441   9.035857
## 40                WATERMELON SCHNAPPS   39578  10.514740
## 41             MISCELLANEOUS SCHNAPPS   36833  12.033186
## 42                     GRAPE SCHNAPPS   34779  10.654928
## 43                  IMPORTED AMARETTO   25090  20.526154
## 44                       BARBADOS RUM   24914  15.059416
## 45                 ROOT BEER SCHNAPPS   24296   9.322246
## 46                     PEACH BRANDIES   23588   7.696805
## 47                    CHERRY BRANDIES   21856   7.767895
## 48                    100 PROOF VODKA   19816  14.089263
## 49                        JAMAICA RUM   19799  16.038717
## 50                STRAWBERRY SCHNAPPS   19578   7.720291
## 51                 RASPBERRY SCHNAPPS   17918   8.544250
## 52                       FLAVORED GIN   17434  10.538608
## 53            TROPICAL FRUIT SCHNAPPS   14127   7.256225
## 54              GREEN CREME DE MENTHE   11782   6.934776
## 55     SINGLE BARREL BOURBON WHISKIES    9574  34.392213
## 56                 AMERICAN SLOE GINS    8731   6.828953
## 57               WHITE CREME DE CACAO    8543   6.967944
## 58                DARK CREME DE CACAO    8263   6.967815
## 59                    LOW PROOF VODKA    5690  12.502111
## 60             MISCELLANEOUS BRANDIES    4580  24.599805
## 61                 SPEARMINT SCHNAPPS    4313   7.203551
## 62                  OTHER PROOF VODKA    3876  11.962781
## 63                         ROCK & RYE    3611  10.966809
## 64            BOTTLED IN BOND BOURBON    3070  14.623684
## 65              WHITE CREME DE MENTHE    2839   6.999466
## 66                                       1755  33.695673
## 67                    CREME DE ALMOND    1710   7.001005
## 68                           ANISETTE    1491   6.927277
## 69                    HIGH PROOF BEER      38 109.174286

We see that 80 proof vodka is the most popular liquor type in Iowa by some margin. Canadian whiskies are also very popular. Also popular are rums, cheap whiskies and tequila. There also exists a category without a name which has a relatively high average price.

On the contrary, the least popular liquors are high proof beer (only 38 bottles sold - a categorisation outlier? A one-off order?) and liqueurs like Anisette (I imagine this is like ouzo), crème liqueurs, and spearmint schnapps (peppermint schnapps on the other hand is quite popular - both must taste like cough medicine!).

Let’s see the most and least popular individual liquors.

##                              DESCRIPTION bottles     price
## 1                           Black Velvet 1422405 10.191468
## 2                          Hawkeye Vodka 1036303  7.313998
## 3              Captain Morgan Spiced Rum  656577 17.603547
## 4              Fireball Cinnamon Whiskey  572895 15.620919
## 5          Jack Daniels Old #7 Black Lbl  473872 25.320187
## 6                           Five O'clock  466019  7.092719
## 7                           Barton Vodka  398614  8.211223
## 8                    Mccormick Vodka Pet  371391  6.719349
## 9                  Smirnoff Vodka 80 Prf  365132 11.796400
## 10          Absolut Swedish Vodka 80 Prf  320092 19.418173
## 11           Crown Royal Canadian Whisky  318719 25.130260
## 12                        Phillips Vodka  307152  9.240963
## 13                  Bacardi Superior Rum  301426 13.453927
## 14           Seagrams 7 Crown Bl Whiskey  300828 12.463819
## 15                  Jagermeister Liqueur  287670 17.999875
## 16                       Mccormick Vodka  278271  3.637401
## 17       Paul Masson Grande Amber Brandy  275110  5.361728
## 18                              Jim Beam  269741 16.211123
## 19             Admiral Nelson Spiced Rum  264521 12.053835
## 20                             Five Star  260876  7.090375
## 21 Jose Cuervo Especial Reposado Tequila  242899 16.597364
## 22                   Paramount White Rum  239648  9.024818
## 23                   Canadian Ltd Whisky  235012  9.412726
## 24                      Southern Comfort  229259 12.327208
## 25               Seagram's Extra Dry Gin  223595  8.402159
## 26        Captain Morgan Original Spiced  223252 13.245653
## 27                    Five O'clock Vodka  220975  7.155968
## 28                      Grey Goose Vodka  219106 24.536126
## 29                    Malibu Coconut Rum  215779 14.376951
## 30             Uv Blue (raspberry) Vodka  194191 11.437719

Surprisingly I’ve never noticed most of these spirits at bars in Europe. Altogether though this list was what I expected: vodkas, whiskeys and rums and there are a few global household names in there such as Captain Morgan, Jack Daniels, Smirnoff, Jagermeister, etc.. One thing is for sure though.. Iowans love vodka and particularly the cheaper local variety! I’m also a bit curious regarding the popularity of cinnamon whiskey. Maybe a success with the student population?

Let’s see the top 30 least popular items now.

##                                     DESCRIPTION bottles   price
## 1                             Calvados Morin XO       1   70.50
## 2                 Framboise (Raspberry Liqueur)       1   45.89
## 3                              Glendronach 18yo       1  109.88
## 4                John J Bowman Str. Bbn.Why. HA       1   23.40
## 5                    Calvados Morin Hors d' Age       2   54.00
## 6              Cedar Ridge Barrel Proof Bourbon       2 8700.00
## 7                                Espolon Blanco       2   20.13
## 8                    Fernandaez Brandy De Jerez       2   54.96
## 9             Glendronach 14yr Virgin Oak - DNO       2   69.30
## 10                    Grappa di Recioto Amarone       2  117.36
## 11          Hirsch Straight Bourbon Small Batch       2   33.00
## 12                              Jameson 12YR HA       2   44.99
## 13                       Johnnie Walker Odyssey       2  747.96
## 14 Macallan Sherry Cask 25 Year Old Scotch - HA       2  845.10
## 15                    New Amsterdam Citron Mini       2    6.93
## 16                            Sam Adams Utopias       2  118.96
## 17   Seagram's Seven Crown American Spiced Mini       2    9.26
## 18                        Taaka King Cake Vodka       2    6.00
## 19                Tooters 50ml Marg, OTB, Mango       2   65.25
## 20                                  Toucano Rum       2   26.04
## 21                        Uncle Bob's Root Beer       2   15.38
## 22                  1800 Ultimate Blueberry RTD       3   14.25
## 23                            Glenmorangie 25yr       3  431.99
## 24               Hiram Walker Blue Curacao 3pak       3    5.85
## 25              Johnnie Walker King George V HA       3  444.98
## 26          Libertine Absinthe(french Absinthe)       3   64.13
## 27        Seagram's Seven Crown American Spiced       3   10.50
## 28        Admiral Nelson Cherry Spiced Rum Mini       4    5.66
## 29                            Ciroc Luxury Mini       4   44.93
## 30                    Irishman Founders Reserve       4   25.01

All items in the top 30 least popular list are in a sense outliers as at most 4 bottles have been ordered altogether. These probably correspond to single orders, and it is likely that some kind of data recording inconsistency has taken place (e.g. product description entered somehow differently).

Interestingly however, a few rather expensive liquors are included in this list - mainly bourbons and whiskeys. For example, two orders were placed for “Cedar Ridge Barrel Proof Bourbon” - with a bottle price of 8700 (!). It turns out that two separate orders were placed on 27/10 and 29/10 by two separate “Hy-Vee Food” stores in different zipcodes in Cedar Rapids. Interesting - anyone’s guess really as to the background story. It has to be a special order for a specific customer (maybe they broke the first bottle? An expensive mistake!).

Initial multivariate exploration

In this part I intended to show some exploratory scatter plots between variables in order to visually identify any stand-out associations, by making use of the great ggpairs function of GGally. There are a couple of issues though. For one, the number of observations is quite large (~2.5 m) and also, the dataset contains 49 variables. There is no way ggpairs will produce a sensible display so we need to be a bit clever and restrict what we feed it.

As a first attempt, let’s look at associations between proportions of some national heritage group proportions.

This one is fairly easy to display, as observations are unique per zipcode (thus restricting the number of observations to about 350 - the number of unique zipcodes) and I have included 7 relevant variables in the data. Let’s see what we get.

The display is a bit compressed but we can still observe positive linear correlations between Hispanic and Mexican, English and Scottish, English and Irish, French and Irish. We can also spot a few negative correlations such as German and English, German and French, black and white, Asian and white. I think these displays aren’t very informative.

Next, let’s look at scatter plots between the main variables of interest in the original liquor sales data. I will perform the analysis on a reasonably sized subset of the data by restricting to a single zipcode (50314 - the highest populated zipcode in Des Moines).

So the raw output here is not very informative and in order to see more one has to adjust axes, convert scales, etc. This will take place for variables of interest later on in this work. But a quick scan through the plots shows a very strong linear correlation between state buying and selling price of items; this implies a fixed profit (in terms of percentage) is set by the state. Also correlated, as expected, are bottle quantities sold per transaction and total money spent per transaction.

I spent some time trying some variable groupings with ddply in order to find some interesting combinations to input to ggpairs but was unable to produce any particularly interesting visualisations at this preliminary exploratory stage. I think that due to plot size restrictions, ggpairs is not the best tool to use for a dataset of this magnitude in terms of number of observations and variables. I also found it was quite hard to find a way to adjust font sizes in the various sections of the output - especially on the diagonal.

Analysis

Demographics

In the following plots we will perform a high-level analysis of basic demographics.

Firstly let’s look at population per zipcode, so we need to group accordingly.

So now we’ve grouped per zipcode and pulled the population.

I realised here that I had not actually merged the coordinates of “zipcode centre” to the master. I got around this by calculating the mean latitude and longitude of stores in the zipcode. Below lies a the map of Iowa where we overlay population bubbles per zipcode and also the location of universities; here the triangle size indicates number of students enrolled.

We can see that Davenport and Iowa city are big University cities. Regarding population, Des Moines, Waterloo, Cedar Rapids, Dubuque and Sioux City are all cities in which certain areas (zipcodes) are highly populated. It will be more useful of course to look at overall city population and that is what we are going to do now.

And county populations below..

I think it’s clear that the centre of Iowa is Des Moines. There are some highly populated areas along the county’s contour, such as Sioux City, Omaha and Dubuque.

Before moving on to the liquor sales analysis I would like to take a look at national group populations.

In the composite plot below I am producing something like a heat map for population percentages per zipcode for a selection of nationalities; “cold” is blue and “hot” is yellow/brown (I found it hard to overlay a global legend.)

And finally

OK now let’s move on to liquor sales.

Liquor sales

Auditing data

Let’s audit the liquor sales data a little bit first just to make sure it is trustworthy.

And on to a box plot which illustrates the monthly distribution of the number of transactions recorded in the database; I also incorporated information for the number of stores appearing in the data on a monthly basis as a colour variable.

We see a drastic reduction in the number of reports; a step change occurring in June. The subsequent plots presenting the same data grouped by week.

Here we see that the step change occurs on the 1st week of June. Which is suspicious. Let’s zoom in on daily data for that period.

So it’s actually the first day of June that the number of reports reduces dramatically. I performed a quick Google search on whether anything changed drastically in terms of Iowa liquor sales legislation, or if anything major happened but to no avail. Furthermore, I believe this is not a seasonal effect (end of Univ. semesters, basketball/football end of season) because the number of reports:

  • Falls too abruptly
  • Does not return to previous levels later in the year

Therefore I feel that this pattern has something to do with data collection/recording.

One thing I observed was that the fiscal year for the relevant Iowa state division runs from July to June. So this dataset partially contains data from two fiscal years (January to June: 2013, July to December: 2014). Could this mean something?

Anyway, I decided not to pursue this investigation any further; it is clear that the dataset has limitations and it is probably best to analyse these two periods independently (at least for time series analysis).

Before I move on to the main body of the analysis, let’s look at the distribution of transactions per weekday.

Here we see that most orders are placed on Wednesdays and Thursdays.

Analysis of liquor sales data

We saw that time series analyses on the full data-set are probably not suitable.

Let’s look at various dataset properties.

Plotting the distribution of bottles sold per transaction.

Not easy to see shape of distribution. Looking at log-transform (x-axis values removed as they are log-transformed).

The log transform shows that the amount of bottles sold per transaction is to a large extent “quantised” (or non-uniform?). Let’s focus on the denser area of the distribution with un-transformed axes.

So here we see that the mode of the distribution is 12; most orders are submitted for a dozen bottles of some liquor. We also see that multiples of 12 are also high occurring, as well as 6 (half-a-dozen).

Now I would like to look at total bottle sales as time-series, so let’s create appropriate variables for the period of January to May.

The created data frame contains three variables: date, total bottles sold and normalised bottles sold. The latter variable is an attempt to adjust for the population bias. This effect arises because we aggregate data from variably populated areas in Iowa and can potentially obscure meaningful relationships in the analysis of the data.

And the plot follows..

When we look at the raw time series of the quantity of bottles ordered, we see some type of seasonality but it is not easy to interpret. However, once we attenuate the population effect, we see a very stable weekly pattern spiking on a single weekday. Let’s look at this data in more detail.

So in the weekday plot we see more clearly this stable pattern in quantities of bottles ordered per citizen. Tuesday is the “spike” day, Monday and Wednesday are pretty much the same; after Wednesday the quantities decline. In my mind this pattern makes sense: Monday is the first day of the week and the stores calculate their weekly requirements by the end of the day. They then submit the order either at end of business or early Tuesday; this allows time for the bottles to be delivered prior to the weekend. They shouldn’t need to order too many bottles after the early week’s budgeting.

It is a bit counter-intuitive though that in a previous plot we saw that the number of orders per weekday (as opposed to quantity of bottles ordered per weekday) was greatest on Wednesday’s and Thursday’s. Maybe if I had corrected for the population bias, a similar trend would have been revealed…

One last note on the previous weekday analysis. Let’s plot the raw and population-adjusted quantities against each other and see what we get. I hope it will show a bit more on how the normalisation helped the analysis.

I like this plot because it looks like a machine learning exercise.. We see here that each day forms an (almost) discrete cluster, with Tuesdays and Fridays having maximum separation from the main body of observations; however Fridays would belong to a high R-squared fitter regression line of all observations excluding Tuesdays. So I think the real outlier here is Tuesdays, in the sense that the normalisation added high information value for that weekday.

Bottle price vs total sales

In this small section I will group data per item (liquor) and plot the relationship between number of bottles ordered and bottle price.

We’ve already generated the relevant data frame gr_item.

Here is the raw plot…

We don’t see too much. In these cases it is often helpful to log-transform axes.

Let’s focus on the main body of observations. The function that adds the annotation to the following plot was taken from stackoverflow..

There is a a weak association (R-squared 0.13) between bottle price and quantities ordered. Therefore, from the plots above one cannot detect a simple function that links price and demand, however it is obvious that the relationship is inverse.

Relationships between order quantities of different liquors

In this small section I would like to look at order quantities as time series (note: using full dataset for this part). Let’s calculate our data frame first, where we group by month and liquor category, calculate total bottles ordered per category/month and derive from this a relevant normalised variable representing percentages over total bottles ordered.

In the subsequent plots I show time series of normalised bottle quantities of orders for each liquor category per month; I have added a colour overlay of season (winter, spring,…) in order to discern seasonal patterns more easily.

A few (qualitative) observations can be made. We can identify that certain liquor types have a demand curve that only peaks in December, for example liqueurs with various flavours like crème de menthe/cacao/etc., brandies and some types of schnapps (cinnamon, butterscotch). Given that these types of liquors have a heavier taste and are generally considered “warming”, their demand curve makes sense - we identified the “winter drinks”.

Another observation that makes sense, is the very similar curve shape of tequila and triple sec; two of the margarita’s ingredients (on the side I plotted respective bottle quantities to see if the 2:1 margarita ratio is reflected in bottle quantities - it didn’t). This demand pattern is there for a few of the liquors (flavoured rums, imported vodka), where bottle quantities are clearly correlated with expected average temperatures (demand lowest in winter, climbs gradually in spring, peaks between mid-spring/mid-summer, gradually descends in autumn). As opposed to the previous category, these are the “summer drinks”.

There is a lot of work that can still be performed here with good potential for hidden insights. Just by looking at the time series we can identify interesting patterns which don’t make immediate sense. For example:

  • Spiced rum peaks every three months at the beginning of each season
  • Whiskies and vodkas in general have highly heterogeneous demand patterns within their broader category
  • Barbados rum quantities peak only on one date - supply issue?

As an extension to the above analysis, I followed a great suggestion by Jay (Udacity reviewer) to group the data by broader liquor categories and season in order to more clearly identify the demand seasonal profile.

So after manually categorising the various liquor categories to 9 liquor types (brandies, gins, liqueurs, miscellaneous, rums, schnapps, tequila, vodkas, whiskies) I created the following informative plot…

Here I’ve manually categorised the seasonal profile of the various liquor types to “cold” (or “winter”) and “warm” (or “summer”). The y axis in this plot is basically an adjusted representation of the bottle sales that takes into account the overall bottle sales per category and also the database shortcomings (countering the overall big drop after June).

We get intuitive results here as gins, rums, vodkas and tequilas are all found to have boosted sales during the warmer season, whilst brandies, whiskies, liqueurs and schnapps sell more during cold weather.

As a final insight in this section, let’s look at time series correlations between categories.

In order to visualise this high number of correlations, I will produce a heatmap using levelmaps in the lattice package.

The diagonal is an axis of symmetry for the correlation matrix of course. So, by the overwhelming green colour of the heatmap one can understand that the majority of the correlation coefficients are positive and moderately strong (values between 0.2 and 0.6).

The stand-out outlier categories in the heatmap are “High Proof Beer” and “Peach Brandies”, as their correlation coefficients are in general very low for every category.

In the following heatmap we display the highest correlations (positive and negative) per category; it is meant to be read line-by-line in this case.

Unfortunately the heatmap function does not support legends, so in purple colour we see the highest positive and in light blue colour the lowest positive correlation coefficient.

So looking at these results, I was again happy to see that the highest correlated liquor with Triple sec is Tequila - Margarita cocktails are very popular in Iowa also it seems! For Tequila the highest correlation is imported cordials and liqueurs.

A quick look at the negative correlations now.

The peach brandies demand pattern is overwhelmingly the most commonly occurring highest negative correlation among the liquors, including Tequila.

It would be useful to plot tequila against its highest positively and negatively correlated categories and see what we get

Here we see that for the highest positive correlations there is indeed a clear linear trend. In the contrary, the highest negative correlations do not really display a strong linear trend.

Out of curiosity, let’s produce the relevant displays for 80 proof vodka.

Spiced rum, Puerto Rico rum and straight bourbon whiskies are the highest positive correlations here and one can see a clear linear trend with very high R-squared values. Coincidentally (?) the same three categories are the highest negative correlations. Whilst high proof beer is most definitely an outlier in terms of either data recording or actual frequency of orders, peach brandies and decanters appear probably uncorrelated altogether. To be fair there does appear to be some negative association for peach brandies, but if one looks at the highest negative correlations for all categories, these two appear extremely frequently - so they are genuine outliers in terms of order patterns.


Final Plots and Summary

Plot One

Description One

The plot above shows an overlay of two time-series demonstrating measures of bottle quantities ordered. Specifically:

  • The raw time-series of bottle quantities ordered
  • The adjusted time-series, where quantities are divided by zipcode population (per order)

It is clear that when the quantities were population-adjusted, a clear periodical pattern is observable. Quantities of bottles ordered per citizen always peak on Tuesdays (annotated by the overlay of dashed lines). Monday and Wednesday are pretty much the same; after Wednesday the quantities decline. This pattern does makes sense: Monday is the first day of the week and the stores calculate their weekly requirements by the end of the day. They then submit the order either at end of business or early Tuesday; this allows time for the bottles to be delivered prior to the weekend. They shouldn’t need to order too many bottles after the early week’s budgeting.

Plot Two

Description Two

This time-series of box plots shows the monthly number of orders submitted by liquor stores; the colour of the box plots shows the number of participating stores on that particular month.

It is evident that some data-recording issue takes place in June (perhaps relevant to fiscal years?), as orders reduce dramatically and do not recover later in the year.

This plot shows that it is best to separate the dataset into 2 periods (Jan-May, June-Dec) for a more robust time-series analysis.

Plot Three

Description Three

This composite plot shows relative quantities ordered for each of the main liquor types. At an elementary level we can categorise the liquors based on their seasonal demand profile as “cold (” or “warm” weather drinks (“winter” and “summer” drinks respectively).

We can clearly identify that certain liquor types have a demand curve that peak during the colder weather season, such as brandies, liqueurs, schnapps and whiskies. Given that these types of liquors have a heavier taste and are generally considered “warming”, their demand curve makes sense.

The other clear demand profile category are the “summer drinks”, as shown by the time series of gins, rums, tequilas and vodkas. Bottle quantities in this category are positively correlated with expected average temperatures and as opposed to the previous category, these are the “summer drinks”.

Whilst the above findings are not surprising they nonetheless provide a useful dataset insight.


Reflection

Due to time limitations this project as it was submitted is not thorough, in the sense that most insights are derived from superficial analyses. Most of my observations are intuitive, and I have not succeeded in revealing any hidden patterns within the time-frame (perhaps with the exception of the clear periodicity of the total population-adjusted bottle quantity orders) - I fell into many dead-end analysis attempts.

In general, I think that identifying the auxiliary datasets and merging them to the master took a longer time than I expected and this limited the amount of time I could dedicate to the actual analysis. However, I think that combining the Iowa Liquor Sales data with the auxiliaries has provided a large, granular and rich data set that enables anyone interested to play with the data and see what they find - I am sharing it on Dropbox.

This project gave me the opportunity to hone my skills in R data wrangling (packages data.table, dplyr are awesome) and data visualisation (ggplot2); not a lot was done in terms of statistical modelling though. I was also happy I got to use the stuff I learnt in a previous Udacity course in scraping tables and request data using API’s. Additionally, I started to use git and GitHub - the integration in RStudio is great.

A little disclaimer about the project.Rdata file I’ve added to the repo; I’ve dropped quite a few columns in order to reduce file-size and conform with GitHub rules - it allows faster knitting of the .rmd that would otherwise take ages. (For the full enriched dataset please see Dropbox link).